Comparative analysis of CNN filter sizes

Evaluating impact of receptive field in Encoder-Decoder and U-Net models for Lane Detection Segmentation

Susanta Deka, Kalyani Kotti (Advisor: Dr. Cohen)

Invalid Date

Introduction to our Project

  • Self-driving cars rely a lot on computer vision to make sense of the road. One of the most important things they need to do is detect lanes so the car knows where to drive safely.

  • Deep learning models, especially CNNs (Convolutional Neural Networks), are great at recognizing patterns in images. More specifically, they focus on parts of the image for making a decision.

  • That’s where something called the receptive field comes in. If the receptive field is too small or not well-designed, the model might miss important context, like curves or broken lane lines.

  • In this project, we explore how the size and shape of the receptive field impacts the model’s ability to accurately detect lanes using simulated driving scenes from the Carla simulator.

Project Objectives

  • Exploring how receptive fields grow across layers in convolutional neural networks. Using the visualization techniques to see what part of the image each layer focuses on and analyzing the ability to capture lane structure, road edges.
  • Test models on the Carla Lane Detection dataset to see how well they detect lanes in realistic driving scenarios.
  • Measure performance using metrics like IoU, accuracy, precision, recall on lane detection outputs. Comparing predictions visually with ground truth to assess model focus and reasoning.

Dataset Overview

Lane Detection for Carla Driving Simulator

For this project we are using dataset based on the Carla (Car Learning to Act) Driving Simulator, an open-source simulator designed for autonomous driving research. Carla simulates driving scenarios that are applicable for training models for tasks like lane detection, object detection, and semantic segmentation.

This data set consists of total samples: 6408 png files = 3075 train + 3075 train_label + 129 val + 129 val_label images.

Understanding Receptive Fields in CNNs

What is a Receptive Field

The receptive field of a neuron (or unit) in a CNN is the region of the input image that affects the output of that neuron.

If a neuron in a deep layer becomes active), the receptive field tells you which area of the original input image caused it to react.

Why is Receptive Field Important in Vision Tasks?

  • Captures Context: Larger RFs help the network understand spatial relationships, objects, and background context.

  • Design Consideration: Helps decide kernel sizes, number of layers, and strides when building CNN architectures.

  • Segmentation Tasks: Accurate pixel-wise predictions require balancing fine details (small RF) and global context (large RF).

  • Deeper Layers = Larger RF: But also need to manage loss of spatial precision.

Receptive Field Growth in CNNs

[Input Image]

[Conv Layer 1] → small receptive field (e.g., 3×3)

[Conv Layer 2] → medium receptive field (e.g., 7×7)

[Conv Layer 3] → large receptive field (e.g., 15×15)

Visualizing Receptive Fields in CNNs

Techniques for Visualizing Receptive Fields

Method | Explanation Gradient-Based | Computes gradients of output w.r.t. input pixels to see which parts affect activation.

Backpropagation | Tracks how changes in pixel values affect activations through layers.

Occlusion Mapping | Slides a gray patch over parts of the input image and sees how output changes.

Increasing Receptive Fields Across Layers

  • As you go deeper into the CNN, the receptive field increases.

  • Early layers capture edges and textures (small RFs).

  • Deeper layers capture shapes, parts, and full objects (large RFs).

Application to Lane Detection

Optimizing the Receptive Field

  • Kernel Sizes:

Explanation: Kernel size controls the area that each filter covers. A larger kernel size increases the receptive field, allowing the model to capture more context in the image.

Example: Using a 5x5 kernel instead of a 3x3 kernel allows the model to capture broader information, improving its ability to detect lane markings over longer distances.

  • Stride:

Explanation: Stride determines how much the filter moves over the input image. Increasing the stride reduces the spatial resolution of the feature maps but increases the effective receptive field.

Example: By using a stride of 2, instead of 1, we can increase the receptive field while reducing the computation required, which is helpful for detecting lanes over long stretches of road.

  • Dilation:

Explanation: Dilation expands the filter by spacing the elements, which increases the receptive field without increasing the number of parameters. This allows the network to capture larger contextual information.

Example: A dilated 3x3 convolution can help capture more context compared to a standard 3x3 filter, enabling the model to detect lane markings that span across larger sections of the image.

Image Classification vs Semantic Segmentation

  • Classification - assign a label

  • Image classification — single label

  • Cat or Dog

Image Classification vs Semantic Segmentation

  • Semantic Segmentation -> assign a label to each pixel

  • Output -> Mask

Convolution Layer

  • Typical CNN architecture

Convolution Layer

  • Feature Extraction

Convolution Layer - Animation

  • Matrix Dot Product

Convolution Layer - Math

  • Matrix Dot Product

  • Sum of Element-Wise multiplication

  • Downsampling

Transposed Convolution

  • Opposite of Convolution

  • Upsample

  • Expand

CNN Encoder-Decoder Architecture

  • Convolution Layers in Encoder

  • Transposed Convolution Layer in Decoder

U-Net

  • Double Convolution Layers

  • Transposed Convolution Layer in Expanding Path

  • Skip Connections (Concat) — Spatial Context

Model Parameter Count

Model Parameters Difference with UNet
CNN 3x3 3,139,587
UNet 3x3 31,037,763 27,898,176
CNN 5x5 8,713,219
UNet 5x5 81,241,411 72,528,192
CNN 7x7 17,073,667
UNet 7x7 156,546,883 139,473,216

Training Setup

  • PyTorch DataLoaders – Batch training

  • Epoch – 10

  • Loss – PyTorch CrossEntropy

PyTorch Cross Entropy

  • Uses softmax and Negative Log Likelihood internally
  • Raw model outputs (logits): [2.5, 1.2, 0.8]
  • After softmax: [0.65, 0.18, 0.17]
  • If the correct class is index 0, NLL loss = -log(0.65) ≈ 0.43
  • If the correct class is index 2, NLL loss = -log(0.17) ≈ 1.77

Training Time

  • UNet-3


Training Time

  • UNet-7


Training Time

  • CNN-7


Training Time

  • CNN-5

Model Parameter Count

Model Parameters Difference with UNet
CNN 3x3 3,139,587
UNet 3x3 31,037,763 27,898,176
CNN 5x5 8,713,219
UNet 5x5 81,241,411 72,528,192
CNN 7x7 17,073,667
UNet 7x7 156,546,883 139,473,216

Comparison of Memory Usage

Performance of Models

  • IoU

  • Dice Coefficient

Performance of Models

Performance of Models

Top Performers

  • CNN-5

  • UNet-3

Model Parameter Count

Model Parameters Difference with UNet
CNN 3x3 3,139,587
UNet 3x3 31,037,763 27,898,176
CNN 5x5 8,713,219
UNet 5x5 81,241,411 72,528,192
CNN 7x7 17,073,667
UNet 7x7 156,546,883 139,473,216

Predictions - CNN-5

Predictions - UNet-3

Conclusion

  • CNN-5 > UNet-3

  • The UNet-3 performs slightly better than the CNN-5

  • Dice score .885 > .85 and the IoU score of .805 > .755

  • CNN-5 is 75% smaller than UNet-3